Abstract

With increasing economic challenges and daily financial demands, the demand for loans continues to rise. Financial institutions receive a large number of loan applications each day, making it difficult to manage them, particularly when evaluations are done manually and there are no reliable methods to assess a candidate’s creditworthiness (Mnkandla et al., 2024). To simplify the process for both applicants and financial institutions, it is essential to develop robust evaluation methods that can accurately predict a loan applicant’s creditworthiness while minimizing risks for lenders. To achieve this, we have decided to explore an existing dataset to identify patterns across various features of loan applicants and determine which factors have the greatest impact on loan approvals. Once these key features are identified, predictive models can be built around them, reducing the need for manual evaluation and enabling data-driven decision-making based on actual application data.

Introduction

To investigate the factors that influence loan decisions, we obtained a dataset from Kaggle (Sharma, 2023) containing various variables for each loan application, such as income, education, CIBIL score, and the final loan status (Approved or Rejected). Our goal is to examine whether each of these variables has any impact on the loan decision. We have formulated around 10 smart questions that explore the potential effect of each variable on the final outcome. The insights gained from this exploratory analysis will guide the development of predictive models capable of accurately determining whether a loan should be approved for a given applicant.

Loading the Dataset

# reading dataset
loan_df <- data.frame(read.csv("loan_approval_dataset.csv"))
# exploring dataset
str(loan_df)
## 'data.frame':    4269 obs. of  13 variables:
##  $ loan_id                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ no_of_dependents        : int  2 0 3 3 5 0 5 2 0 5 ...
##  $ education               : chr  " Graduate" " Not Graduate" " Graduate" " Graduate" ...
##  $ self_employed           : chr  " No" " Yes" " No" " No" ...
##  $ income_annum            : int  9600000 4100000 9100000 8200000 9800000 4800000 8700000 5700000 800000 1100000 ...
##  $ loan_amount             : int  29900000 12200000 29700000 30700000 24200000 13500000 33000000 15000000 2200000 4300000 ...
##  $ loan_term               : int  12 8 20 8 20 10 4 20 20 10 ...
##  $ cibil_score             : int  778 417 506 467 382 319 678 382 782 388 ...
##  $ residential_assets_value: int  2400000 2700000 7100000 18200000 12400000 6800000 22500000 13200000 1300000 3200000 ...
##  $ commercial_assets_value : int  17600000 2200000 4500000 3300000 8200000 8300000 14800000 5700000 800000 1400000 ...
##  $ luxury_assets_value     : int  22700000 8800000 33300000 23300000 29400000 13700000 29200000 11800000 2800000 3300000 ...
##  $ bank_asset_value        : int  8000000 3300000 12800000 7900000 5000000 5100000 4300000 6000000 600000 1600000 ...
##  $ loan_status             : chr  " Approved" " Rejected" " Rejected" " Rejected" ...

The dataset has 4269 observations with 13 variables

Data Cleaning & Summary

subsetting data and converting categorical variables to factor

Removing loan_id column since it only serves as a unique identifier and does not contribute to the analysis. Furthermore, converting several numeric variables into factors, as they represent categorical information rather than continuous numerical values.

loan_df <- subset(loan_df, select = -c(loan_id))

# converting the numeric variables to factor variables
loan_df$no_of_dependents = as.factor(loan_df$no_of_dependents)
loan_df$education = as.factor(loan_df$education)
loan_df$self_employed = as.factor(loan_df$self_employed)
loan_df$loan_status = as.factor(loan_df$loan_status)
str(loan_df)
## 'data.frame':    4269 obs. of  12 variables:
##  $ no_of_dependents        : Factor w/ 6 levels "0","1","2","3",..: 3 1 4 4 6 1 6 3 1 6 ...
##  $ education               : Factor w/ 2 levels " Graduate"," Not Graduate": 1 2 1 1 2 1 1 1 1 2 ...
##  $ self_employed           : Factor w/ 2 levels " No"," Yes": 1 2 1 1 2 2 1 2 2 1 ...
##  $ income_annum            : int  9600000 4100000 9100000 8200000 9800000 4800000 8700000 5700000 800000 1100000 ...
##  $ loan_amount             : int  29900000 12200000 29700000 30700000 24200000 13500000 33000000 15000000 2200000 4300000 ...
##  $ loan_term               : int  12 8 20 8 20 10 4 20 20 10 ...
##  $ cibil_score             : int  778 417 506 467 382 319 678 382 782 388 ...
##  $ residential_assets_value: int  2400000 2700000 7100000 18200000 12400000 6800000 22500000 13200000 1300000 3200000 ...
##  $ commercial_assets_value : int  17600000 2200000 4500000 3300000 8200000 8300000 14800000 5700000 800000 1400000 ...
##  $ luxury_assets_value     : int  22700000 8800000 33300000 23300000 29400000 13700000 29200000 11800000 2800000 3300000 ...
##  $ bank_asset_value        : int  8000000 3300000 12800000 7900000 5000000 5100000 4300000 6000000 600000 1600000 ...
##  $ loan_status             : Factor w/ 2 levels " Approved"," Rejected": 1 2 2 2 2 2 1 2 1 2 ...

After conversion, we have the following variables in the dataset:

Variables in dataset

library(knitr)

# Sample data frame
loan_data <- data.frame(
  Variable = c("no_of_dependents", "education", "self_employed", "income_annum", "loan_amount", "loan_term", "cibil_score", "residential_assets_value", "commercial_assets_value", "luxury_assets_value", "bank_asset_value", "loan_status"),
  Description = c("Number of dependents an applicant has ranging from 0 to 5", "Education level (Graduate or Not Graduate)", "Whether the applicant works independently or for an employer", "Annual income of applicant", "Amount of loan requested", "The number of years in which the applicant will repay the loan","Credit score indicating applicants history of repayment", "Value of applicant's residential assets if any", "Value of applicant's commercial assets if any", "Value of applicant's luxury assets if any", "Value of applicant's bank assets if any", "Target variable indicating whether the loan was Approved or Rejected"),
  Type = c("Categorical", "Categorical", "Categorical", "Numeric", "Numeric", "Numeric", "Numeric", "Numeric", "Numeric", "Numeric", "Numeric", "Categorical")
)

# Create table
kable(loan_data)
Variable Description Type
no_of_dependents Number of dependents an applicant has ranging from 0 to 5 Categorical
education Education level (Graduate or Not Graduate) Categorical
self_employed Whether the applicant works independently or for an employer Categorical
income_annum Annual income of applicant Numeric
loan_amount Amount of loan requested Numeric
loan_term The number of years in which the applicant will repay the loan Numeric
cibil_score Credit score indicating applicants history of repayment Numeric
residential_assets_value Value of applicant’s residential assets if any Numeric
commercial_assets_value Value of applicant’s commercial assets if any Numeric
luxury_assets_value Value of applicant’s luxury assets if any Numeric
bank_asset_value Value of applicant’s bank assets if any Numeric
loan_status Target variable indicating whether the loan was Approved or Rejected Categorical

Finding NA values

sum(is.na(loan_df))
## [1] 0

There are no NA values in the data set

Summary of the dataset

summary(loan_df)
##  no_of_dependents         education    self_employed  income_annum    
##  0:712             Graduate    :2144    No :2119     Min.   : 200000  
##  1:697             Not Graduate:2125    Yes:2150     1st Qu.:2700000  
##  2:708                                               Median :5100000  
##  3:727                                               Mean   :5059124  
##  4:752                                               3rd Qu.:7500000  
##  5:673                                               Max.   :9900000  
##   loan_amount         loan_term     cibil_score  residential_assets_value
##  Min.   :  300000   Min.   : 2.0   Min.   :300   Min.   : -100000        
##  1st Qu.: 7700000   1st Qu.: 6.0   1st Qu.:453   1st Qu.: 2200000        
##  Median :14500000   Median :10.0   Median :600   Median : 5600000        
##  Mean   :15133450   Mean   :10.9   Mean   :600   Mean   : 7472617        
##  3rd Qu.:21500000   3rd Qu.:16.0   3rd Qu.:748   3rd Qu.:11300000        
##  Max.   :39500000   Max.   :20.0   Max.   :900   Max.   :29100000        
##  commercial_assets_value luxury_assets_value bank_asset_value  
##  Min.   :       0        Min.   :  300000    Min.   :       0  
##  1st Qu.: 1300000        1st Qu.: 7500000    1st Qu.: 2300000  
##  Median : 3700000        Median :14600000    Median : 4600000  
##  Mean   : 4973155        Mean   :15126306    Mean   : 4976692  
##  3rd Qu.: 7600000        3rd Qu.:21700000    3rd Qu.: 7100000  
##  Max.   :19400000        Max.   :39200000    Max.   :14700000  
##     loan_status  
##   Approved:2656  
##   Rejected:1613  
##                  
##                  
##                  
## 

Overall, the dataset appears well-structured and balanced. Categorical variables such as education, self_employment, and loan_status show nearly even distributions across their categories. Most numerical variables fall within reasonable ranges, with average loan terms around 11 years and CIBIL scores centered near 600, both indicating realistic applicant profiles. However, some financial variables like income, loan amount, and asset values show wide variation and possible outliers. In particular, the presence of negative values in residential_assets_value points to potential data quality issues that will need correction before exploration or testing.

Cleaning up negative values

Checking the number of rows that have negative values for residential_assets_value column

sum(loan_df$residential_assets_value < 0)
## [1] 28

There are 28 entries with negative values.

Removing entries with negative values

loan_df <- loan_df[loan_df$residential_assets_value >= 0, ]
str(loan_df)
## 'data.frame':    4241 obs. of  12 variables:
##  $ no_of_dependents        : Factor w/ 6 levels "0","1","2","3",..: 3 1 4 4 6 1 6 3 1 6 ...
##  $ education               : Factor w/ 2 levels " Graduate"," Not Graduate": 1 2 1 1 2 1 1 1 1 2 ...
##  $ self_employed           : Factor w/ 2 levels " No"," Yes": 1 2 1 1 2 2 1 2 2 1 ...
##  $ income_annum            : int  9600000 4100000 9100000 8200000 9800000 4800000 8700000 5700000 800000 1100000 ...
##  $ loan_amount             : int  29900000 12200000 29700000 30700000 24200000 13500000 33000000 15000000 2200000 4300000 ...
##  $ loan_term               : int  12 8 20 8 20 10 4 20 20 10 ...
##  $ cibil_score             : int  778 417 506 467 382 319 678 382 782 388 ...
##  $ residential_assets_value: int  2400000 2700000 7100000 18200000 12400000 6800000 22500000 13200000 1300000 3200000 ...
##  $ commercial_assets_value : int  17600000 2200000 4500000 3300000 8200000 8300000 14800000 5700000 800000 1400000 ...
##  $ luxury_assets_value     : int  22700000 8800000 33300000 23300000 29400000 13700000 29200000 11800000 2800000 3300000 ...
##  $ bank_asset_value        : int  8000000 3300000 12800000 7900000 5000000 5100000 4300000 6000000 600000 1600000 ...
##  $ loan_status             : Factor w/ 2 levels " Approved"," Rejected": 1 2 2 2 2 2 1 2 1 2 ...
summary(loan_df)
##  no_of_dependents         education    self_employed  income_annum    
##  0:706             Graduate    :2127    No :2106     Min.   : 200000  
##  1:696             Not Graduate:2114    Yes:2135     1st Qu.:2700000  
##  2:701                                               Median :5100000  
##  3:725                                               Mean   :5074251  
##  4:744                                               3rd Qu.:7500000  
##  5:669                                               Max.   :9900000  
##   loan_amount         loan_term     cibil_score  residential_assets_value
##  Min.   :  300000   Min.   : 2.0   Min.   :300   Min.   :       0        
##  1st Qu.: 7700000   1st Qu.: 6.0   1st Qu.:453   1st Qu.: 2200000        
##  Median :14600000   Median :10.0   Median :600   Median : 5700000        
##  Mean   :15178401   Mean   :10.9   Mean   :600   Mean   : 7522613        
##  3rd Qu.:21500000   3rd Qu.:16.0   3rd Qu.:747   3rd Qu.:11400000        
##  Max.   :39500000   Max.   :20.0   Max.   :900   Max.   :29100000        
##  commercial_assets_value luxury_assets_value bank_asset_value  
##  Min.   :       0        Min.   :  300000    Min.   :       0  
##  1st Qu.: 1300000        1st Qu.: 7500000    1st Qu.: 2400000  
##  Median : 3700000        Median :14600000    Median : 4600000  
##  Mean   : 4985121        Mean   :15171210    Mean   : 4991488  
##  3rd Qu.: 7700000        3rd Qu.:21700000    3rd Qu.: 7100000  
##  Max.   :19400000        Max.   :39200000    Max.   :14700000  
##     loan_status  
##   Approved:2640  
##   Rejected:1601  
##                  
##                  
##                  
## 

After filtering out the rows with negative values, the dataset now has 4,241 observations. The distributions across all variables look almost identical to before, showing that this cleaning step didn’t affect the data overall.

Exploratory Data Analysis

Does education level affect loan approval?

Here, we aim to examine whether the person’s education (Graduated or Not Graduated) affects the approval or rejection of a loan. We begin by visualizing the distribution of approved and rejected applications for each education group using a bar chart.

library(ggplot2)
ggplot(loan_df, aes(x = loan_status, fill = education)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Loan Approval by Education Level",
    x = "Loan Status",
    y = "Number of Applicants"
  ) +
  scale_fill_manual(
    values = c(" Graduate" = "#FF7C61", " Not Graduate" = "#50E5C8")
  ) +
  theme_minimal()

Approval rates are nearly identical for Graduated and Not Graduated applicants, suggesting education likely doesn’t affect loan approval. A Chi-square test will confirm this statistically.

Hypotheses:

  • Null hypothesis (H₀): Loan approval is independent of education level (Graduated vs Not Graduated).
  • Alternative hypothesis (H₁): Loan approval depends on education level; there is an association between education and loan approval.
edu_table <- table(loan_df$education, loan_df$loan_status)
print(edu_table)
##                
##                  Approved  Rejected
##    Graduate          1329       798
##    Not Graduate      1311       803
edu_chi <- chisq.test(edu_table)
print(edu_chi)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  edu_table
## X-squared = 0.08, df = 1, p-value = 0.8

Conclusion

The Chi-square test yielded a p-value of 0.8. Since this is greater than 0.05, we fail to reject the null hypothesis, indicating that there is no significant association between education level and loan approval. This confirms that, in this dataset, education does not appear to affect the likelihood of a loan being approved.

Do people with higher asset values tend to get approved more often?

Here, we aim to examine whether the value of a person’s assets, including residential, commercial, luxury, and bank assets, individually affects the approval or rejection of a loan. We begin by visualizing the frequency distribution of each type of asset value.

# 1. Residential Assets Histogram
hist(loan_df$residential_assets_value,
     main = "Distribution of Residential Assets Value",
     xlab = "Residential Assets Value",
     col = "lightblue",
     border = "black")

# 2. Commercial Assets Histogram
hist(loan_df$commercial_assets_value,
     main = "Distribution of Commercial Assets Value",
     xlab = "Commercial Assets Value",
     col = "lightgreen",
     border = "black")

# 3. Luxury Assets Histogram
hist(loan_df$luxury_assets_value,
     main = "Distribution of Luxury Assets Value",
     xlab = "Luxury Assets Value",
     col = "pink",
     border = "black")

# 4. Bank Assets Histogram
hist(loan_df$bank_asset_value,
     main = "Distribution of Bank Assets Value",
     xlab = "Bank Assets Value",
     col = "salmon",
     border = "black")

# Reset plotting layout to default (1 plot per screen) after generation
par(mfrow = c(1, 1))

Each type of asset value shows a right-skewed distribution, meaning most individuals hold lower asset values while a smaller number possess much higher ones, indicating non-normality. To statistically verify this, we can apply the Shapiro–Wilk test, which checks whether the data follow a normal distribution.

Hypotheses:

  • Null hypothesis (H₀): The data are normally distributed.
  • Alternative hypothesis (H₁): The data are not normally distributed. We use this test to confirm whether the visual observation of skewness is statistically significant.
shapiro.test(loan_df$residential_assets_value)
## 
##  Shapiro-Wilk normality test
## 
## data:  loan_df$residential_assets_value
## W = 0.9, p-value <2e-16
shapiro.test(loan_df$commercial_assets_value)
## 
##  Shapiro-Wilk normality test
## 
## data:  loan_df$commercial_assets_value
## W = 0.9, p-value <2e-16
shapiro.test(loan_df$luxury_assets_value)
## 
##  Shapiro-Wilk normality test
## 
## data:  loan_df$luxury_assets_value
## W = 1, p-value <2e-16
shapiro.test(loan_df$bank_asset_value)
## 
##  Shapiro-Wilk normality test
## 
## data:  loan_df$bank_asset_value
## W = 1, p-value <2e-16

Since all p-values are far below 0.05, we reject the null hypothesis for each test. This indicates that none of the asset value distributions are normally distributed, which is consistent with the right-skewed patterns observed in the histograms. Let’s visualize the impact of asset values on loan status using a box plot.

# Convert to long format
loan_long <- loan_df %>%
  pivot_longer(
    cols = c(residential_assets_value, commercial_assets_value, luxury_assets_value, bank_asset_value),
    names_to = "asset_type",
    values_to = "asset_value"
  )

# Boxplots all in one figure
ggplot(loan_long, aes(x = asset_type, y = asset_value, fill = loan_status)) +
  geom_boxplot(position = position_dodge(width = 0.8)) +
  labs(
    title = "Asset Distributions by Loan Status",
    x = "Asset Type",
    y = "Asset Value"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c(" Approved" = "#1b9e77", " Rejected" = "#d95f02"))

The median and range of each asset type are similar between Approved and Rejected loans, suggesting little apparent effect on loan decisions. Since the Shapiro-Wilk test showed that asset values are not normally distributed, we use the Wilcoxon rank-sum test to formally assess differences.

Hypotheses for each asset type:

  • Null hypothesis (H₀): The distribution of asset values is the same for Approved and Rejected loans.
  • Alternative hypothesis (H₁): The distribution of asset values differs between Approved and Rejected loans.
wilcox.test(residential_assets_value ~ loan_status, data = loan_df)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  residential_assets_value by loan_status
## W = 2e+06, p-value = 0.2
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(commercial_assets_value ~ loan_status, data = loan_df)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  commercial_assets_value by loan_status
## W = 2e+06, p-value = 0.7
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(luxury_assets_value ~ loan_status, data = loan_df)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  luxury_assets_value by loan_status
## W = 2e+06, p-value = 0.2
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(bank_asset_value ~ loan_status, data = loan_df)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  bank_asset_value by loan_status
## W = 2e+06, p-value = 0.4
## alternative hypothesis: true location shift is not equal to 0

Conclusion

All p-values are greater than 0.05, so we fail to reject the null hypothesis for each asset type. This indicates that there is no statistically significant difference in asset values between Approved and Rejected loans, supporting the initial observation that asset values do not appear to affect loan approval decisions.

Is there a significant difference in average income between approved and rejected applicants?

Here, we aim to examine whether the person’s annual income affects the approval or rejection of a loan. We begin by visualizing the frequency distribution of annual income.

hist(loan_df$income_annum,
     main = "Distribution of Annual income",
     xlab = "Annual income",
     col = "salmon",
     border = "black")

The frequency distribution shows that annual income is roughly uniform and symmetric, with no noticeable skew or extreme outliers. We can now examine the distribution of annual income for Approved and Rejected applicants using a box plot.

ggplot(loan_df, aes(x = loan_status, y = income_annum, fill = loan_status)) +
  geom_boxplot() +
  labs(title = "Annual Distribution by Loan Status", x = "Loan Status", y = "Annual Income") +
  scale_fill_manual(values = c(" Approved" = "#12362A", " Rejected" = "#8DD9C1"))

The box plot shows that the income distributions for both Approved and Rejected applicants are quite similar. The median income for both groups is approximately 5 million, and the small difference in the lower quartiles appears negligible. To statistically verify this observation, we will perform a two-sample t-test to compare the means of the two groups.

  • Null Hypothesis (H₀): There is no significant difference in the average annual income between approved and rejected applicants.

  • Alternative Hypothesis (H₁): There is a significant difference in the average annual income between approved and rejected applicants.

approved_income <- loan_df$income_annum[loan_df$loan_status == " Approved"]
rejected_income <- loan_df$income_annum[loan_df$loan_status == " Rejected"]
t.test(approved_income, rejected_income)
## 
##  Welch Two Sample t-test
## 
## data:  approved_income and rejected_income
## t = -1, df = 3435, p-value = 0.2
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -277421   68867
## sample estimates:
## mean of x mean of y 
##   5034886   5139163

Conclusion

Since the p-value(0.2)> 0.05, we fail to reject the null hypothesis, indicating that there is no statistically significant difference in mean income between Approved and Rejected applicants.So the data supports the earlier observation from the box plot i.e. annual income does not appear to affect loan approval in this dataset.

Is there a correlation between applicant’s annual income and loan amount requested?

Here, we aim to examine whether the loan amount requested is influenced by an applicant’s annual income. Understanding this relationship helps identify potential multicollinearity. Moreover, We already know that annual income is not skewed and roughly uniform and symmetric, we can now visualize the distribution for loan amount requested.

hist(loan_df$loan_amount,
     main = "Distribution of Loan amount requested",
     xlab = "Loan amount requested",
     col = "yellow",
     border = "black")

The distribution of loan amount is right-skewed and deviates from normality. Therefore, we will first visualize the relationship between annual income and the requested loan amount, and then use Spearman’s correlation test, as it does not assume normality.

annual_income <- loan_df$income_annum
loan_amount <- loan_df$loan_amount

ggplot(data = loan_df, aes(x = loan_amount, y = income_annum)) +
  geom_point(color = "#F7AC19") +
  labs(
    title = "Scatter plot of Annual Income vs Loan Amount",
    x = "Loan Amount",
    y = "Annual Income"
  ) +
  theme_minimal()

A positive correlation is evident, indicating that applicants with higher annual incomes tend to apply for and receive larger loan amounts. We can use a spearman correlation test to verify this.

  • Null Hypothesis (H₀): There is no relationship between annual income and loan amount among applicants.

  • Alternative Hypothesis (H₁): Applicants with higher annual incomes tend to apply for and receive larger loan amounts.

cor.test(annual_income, loan_amount, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  annual_income and loan_amount
## S = 8e+08, p-value <2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##   rho 
## 0.941

Conclusion:

The Spearman’s rank correlation test shows a very strong positive correlation between annual income and loan amount (ρ = 0.941, p < 0.001). This indicates that as applicants’ annual income increases, the loan amount they apply for and receive also tends to increase. Since the p-value is far below 0.05, we reject the null hypothesis and conclude that there is a significant positive relationship between annual income and loan amount. This strong association suggests that the two variables convey similar information.

Is there a significant difference in average CIBIL scores between approved and rejected applicants?

Here, we are investigating to see if there is a significant difference in CIBIL Scores of Approved and Rejected candidates. First we plot the entire distribution to identify outliers then we plot a chart showing the Approved and Rejected populations.

library(ggplot2)
#CIBIL score frequency distribution
hist(loan_df$cibil_score, main = "Frequency Distribution of CIBIL Scores", xlab = "CIBIL Score", ylab = "Frequency", breaks = 15, col = "#197571")

#CIBIL score frequency box chart
ggplot(loan_df, aes(x = loan_status, y = cibil_score, fill = loan_status)) +
  geom_boxplot() +
  labs(title = "CIBIL Score Distribution by Loan Status", x = "Loan Status", y = "CIBIL Score") +
  scale_fill_manual(values = c(" Approved" = "#197571", " Rejected" = "#FFCF85"))

#t-test
t.test(cibil_score ~ loan_status, data = loan_df)
## 
##  Welch Two Sample t-test
## 
## data:  cibil_score by loan_status
## t = 88, df = 4238, p-value <2e-16
## alternative hypothesis: true difference in means between group  Approved and group  Rejected is not equal to 0
## 95 percent confidence interval:
##  268 280
## sample estimates:
## mean in group  Approved mean in group  Rejected 
##                     703                     429

Hypotheses:

  • Null hypothesis (H₀): The average CIBIL Scores for Approved and Rejected applicants are equal.
  • Alternative hypothesis (H₁): The average CIBIL Scores for Approved and Rejected applicants are not equal.

We use this test to confirm whether the CIBIL Scores for the Approved and Rejected applicants are statistically significantly different or the same.

Conclusion

The overall CIBIL score frequency distribution shows no apparent outliers. The individual distributions for approved and rejected loans show a few outliers, but it is not necessary to remove them. Even though the outliers are dragging the two means closer to each other, the means are still significantly different even when including them in the t-test.

There is a significant difference in CIBIL scores between approved and rejected applicants because the p-value, 2e-16, is much lower than the standard alpha threshold of 0.05. This allows us to reject the null hypothesis that the two means of approved and rejected applicants are equal.

For rejected loans, is there a significant difference in the mean CIBIL score between applicants with shorter loan term loans (<=10 years) versus longer term loans(>10 years)?

Here, we are investigating to see if there is a significant difference in CIBIL Scores in the Rejected applicant population depending if they want a short (less than or equal to 10 years) vs longer term loan (greater than 10 years). First we plot the entire distribution to identify outliers then we plot a chart showing the Rejected population with shorter and longer term loans.

#created subsets for shorter and longer loans then ran t-test
rejected_shorter <- subset(loan_df, trimws(loan_status) == "Rejected" & loan_term <= 10)
rejected_longer <- subset(loan_df, trimws(loan_status) == "Rejected" & loan_term > 10)

#Overall loan term frequency distribution
ggplot(loan_df, aes(x = factor(loan_term))) + geom_bar(fill = "#7B3D91") + labs(title = "Frequency Distribution of Loan Terms", x = "Loan Term (Years)", y = "Frequency")

#boxplot
boxplot(
  rejected_shorter$cibil_score,
  rejected_longer$cibil_score,
  names = c("Shorter (<= 10 yrs)", "Longer (> 10 yrs)"),
  main = "CIBIL Score Distribution for Rejected Loans",
  ylab = "CIBIL Score",
  xlab = "Loan Term Group",
  col = c("#A769C2", "#87BDCC")  # add your custom colors here
)

#removing outliers
removeOutliers <- function(x) {Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75);
IQR_val <- Q3 - Q1;
lower_bound <- Q1 - (1.5 * IQR_val);
upper_bound <- Q3 + (1.5 * IQR_val);
  
return(x[x >= lower_bound & x <= upper_bound]);}

cleaned_shorter_cibil <- removeOutliers(rejected_shorter$cibil_score)
cleaned_longer_cibil <- removeOutliers(rejected_longer$cibil_score)

t.test(rejected_shorter$cibil_score, rejected_longer$cibil_score)
## 
##  Welch Two Sample t-test
## 
## data:  rejected_shorter$cibil_score and rejected_longer$cibil_score
## t = -0.03, df = 1564, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.72  7.51
## sample estimates:
## mean of x mean of y 
##       429       429
t.test(cleaned_shorter_cibil, cleaned_longer_cibil)
## 
##  Welch Two Sample t-test
## 
## data:  cleaned_shorter_cibil and cleaned_longer_cibil
## t = -0.05, df = 1563, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.47  7.09
## sample estimates:
## mean of x mean of y 
##       427       428

Hypothesis:

  • Null hypothesis (H₀): The average CIBIL Scores of Rejected applicants with shorter loan terms (less than or equal to 10 years) is equal to that of Rejected applicants with longer loan term (greater than 10 years).
  • Alternative hypothesis (H₁): The average CIBIL Scores of Rejected applicants with shorter loan terms (less than or equal to 10 years) is not equal to that of Rejected applicants with longer loan term (greater than 10 years).

We use this test to confirm whether the CIBIL Scores for the Rejected applicants with shorter loan terms and longer loan terms are statistically significantly different or the same.

Conclusion

The overall loan terms frequency distribution shows no apparent outliers however there are a few in the shorter and longer loan term groups. Having this said, even when removing these outliers there no significant differences found between the two groups and their respective CIBIL scores.

There is not a significant difference in CIBIL scores between shorter (<= 10 years) and longer (>10 years) term loans within the rejected group of applicants because the p-value, 1, is much higher than the standard alpha threshold of 0.05. This allows us to accept the null hypothesis that the two means of shorter and longer term loans of rejected applicants are equal.

Is there a correlation between the CIBIL Score and the Loan Term among those whose loans are approved?

Here, we are investigating whether or not there is a correlation among Approved applicants between the two variables, CIBIL Score and Loan Term (Years). A correlation test is conducted to determine the correlation coefficient and p-value. A graph is also plotted showing the relationship between CIBIL Score and Loan Term.

#correlation coefficient
corr_r <- cor(loan_df[trimws(loan_df$loan_status) == "Approved", "cibil_score"], loan_df[trimws(loan_df$loan_status) == "Approved", "loan_term"], method = "pearson")

#correlation test
cor.test(loan_df[trimws(loan_df$loan_status) == "Approved", "cibil_score"], loan_df[trimws(loan_df$loan_status) == "Approved", "loan_term"], method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  loan_df[trimws(loan_df$loan_status) == "Approved", "cibil_score"] and loan_df[trimws(loan_df$loan_status) == "Approved", "loan_term"]
## t = 11, df = 2638, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.173 0.246
## sample estimates:
##  cor 
## 0.21
#scatter plot
ggplot(subset(loan_df, trimws(loan_status) == "Approved"), aes(x = loan_term, y = cibil_score)) + geom_point(alpha = 0.4, color = "#AAE5D8") + geom_smooth(method = "lm", se = FALSE, color = "#061411") + labs(title = "CIBIL Score vs. Loan Term for Approved Loans", x = "Loan Term (Years)", y = "CIBIL Score")

Hypothesis:

  • Null hypothesis (H₀): There is no correlation between CIBIL Score and Loan Term (Years) of Approved applicants.
  • Alternative hypothesis (H₁): There is a correlation between CIBIL Score and Loan Term (Years) of Approved applicants.

We use this test to confirm whether there is a correlation between CIBIL Scores and loan terms for the applicants who were Approved for their loan.

Conclusion

The correlation coefficient of 0.21 shows a weak but positive relationship where approved candidates with higher CIBIL scores have a slightly higher tendency to approved for longer loan terms. And the p-value is very small, 2e-16, meaning that the correlation coefficient is unlikely due to chance.

Does self-employment status affect loan approval?

Here, we aim to examine whether the person’s employment status affects the approval or rejection of a loan. We begin by visualizing the distribution of approved and rejected applications for each employment group using a bar chart.

ggplot(loan_df, aes(x = self_employed, fill = loan_status)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Loan Approval by self employment status",
    x = "self employment status",
    y = "Number of Applicants"
  ) +
  scale_fill_manual(values = c(" Approved" = "#F2764E", " Rejected" = "#FAF487")) +
  theme_minimal()

Approval rates are nearly identical for self-employed and non self-employed applicants, suggesting self-employment likely doesn’t affect loan approval. A Chi-square test will confirm this statistically.

Hypotheses:

  • Null hypothesis (H₀): Loan approval is independent of employment status
  • Alternative hypothesis (H₁): Loan approval depends on employment status; there is an association between employment status and loan approval.
contable <- table(loan_df$self_employed, loan_df$loan_status)
chisq.test(contable)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contable
## X-squared = 0.009, df = 1, p-value = 0.9

Conclusion

The Chi-square test (p = 1) also shows no significant association between self-employment status and loan approval. This confirms that self-employment does not affect the likelihood of loan approval in this dataset.

Does the number of dependents affect approval of loans?

Here, we will see whether the person’s dependents affects the approval or rejection of a loan. We begin by visualizing the distribution of approved and rejected applications for each dependents’ group using a bar chart.

ggplot(loan_df, aes(x = no_of_dependents, fill = loan_status)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Loan Approval by Number of Dependents",
    x = "Number of Dependents",
    y = "Number of Applicants"
  ) +
  scale_fill_manual(values = c(" Approved" = "#0D0942", " Rejected" = "#C4C0F6")) +
  theme_minimal()

The bar chart shows that approval and rejection counts are fairly similar across all dependent categories, with no noticeable trend suggesting that the number of dependents strongly influences loan approval outcomes. While minor variations exist, the approval rate appears consistent across groups, indicating that loan approval is likely independent of the number of dependents. A Chi-square test will confirm this statistically.

Hypotheses:

  • Null hypothesis (H₀): Number of dependents is independent of employment status
  • Alternative hypothesis (H₁): Number of dependents depend on employment status; there is an association between Number of dependents and loan approval.
contable <- table(loan_df$no_of_dependents, loan_df$loan_status)
chisq.test(contable)
## 
##  Pearson's Chi-squared test
## 
## data:  contable
## X-squared = 2, df = 5, p-value = 0.8

Conclusion

The Chi-square test (p = 0.8) also shows no significant association between no_of_dependents and loan approval. This confirms that no_of_dependents does not affect the likelihood of loan approval in this dataset.

Does requested loan term affect approval?

The goal is to see whether loan term will affect the approval of loan, we have already seen the distribution of loan terms so now we will use a box plot to visualize whether certain loan terms have more chance of approval than others.

ggplot(loan_df, aes(x = loan_status, y = loan_term, fill = loan_status)) +
  geom_boxplot(alpha = 0.6) +
  labs(
    title = "Loan Term by Loan Approval Status",
    x = "Loan Status",
    y = "Loan Term (Years)"
  ) +
  scale_fill_manual(values = c(" Approved" = "#98C25F", " Rejected" = "#EB3BA7")) +
  theme_minimal()

The boxplot shows that approved loans have a wider range of terms, including very short loans, while rejected loans generally have longer terms. The median loan term for approved loans is 10 years, compared to 12 years for rejected loans, indicating that rejected loans tend to have slightly longer terms. To statistically verify this we will perform a t-test with following null and alternate hypothesis:

  • Null Hypothesis (H₀): There is no significant difference in the mean loan term between approved and rejected loans.
  • Alternate Hypothesis (H₁): The mean loan term for rejected loans is greater than that for approved loans.
t.test(loan_term ~ loan_status, data = loan_df, tail="greater")
## 
##  Welch Two Sample t-test
## 
## data:  loan_term by loan_status
## t = -8, df = 3637, p-value = 2e-14
## alternative hypothesis: true difference in means between group  Approved and group  Rejected is not equal to 0
## 95 percent confidence interval:
##  -1.69 -1.01
## sample estimates:
## mean in group  Approved mean in group  Rejected 
##                    10.4                    11.7

Conclusion

The two-sample t-test indicates a significant difference in means (p < 0.05) so we can reject the null hypothesis, with approved loans having a shorter average term (10.4 years) than rejected loans (11.7 years). The 95% confidence interval for the mean difference is -1.67 to -0.988 (the interval doesn’t have 0), confirming that approved loans tend to have shorter terms.

Overall Findings and Future Work

The exploratory data analysis and hypothesis testing conducted in this study provide several key insights into the factors influencing loan approval outcomes. The results indicate that CIBIL score serves as a significant determinant of approval decisions, with approved applicants demonstrating notably higher average scores than rejected ones. This reinforces the importance of an applicant’s creditworthiness in lending assessments. Additionally, the analysis revealed that loan term plays a critical role, as approved loans tend to have shorter durations, suggesting that lenders may prefer applicants seeking lower-risk, shorter-term loans.

While annual income exhibited a strong positive correlation with the loan amount requested, it did not display a significant direct relationship with loan approval status. This finding implies that although applicants with higher incomes tend to request larger loans, income alone may not substantially affect the likelihood of approval once other financial indicators are considered. Thus, CIBIL score and loan term emerge as the most impactful variables for understanding and predicting loan approval decisions, whereas annual income may serve as a supporting predictor variable rather than a primary determinant.

In contrast, variables such as assets value, loan amount requested, and employment status did not exhibit a notable impact on loan approval outcomes during this exploratory phase. While these features may still add some background information, they appear to play a limited role in influencing the approval decision compared to credit and term-related variables.

Future work should focus on developing predictive models to validate these findings and quantify the relative influence of key variables. Variables such as CIBIL score, loan term, and annual income can be incorporated into classification models such as linear and logistic regression to predict loan approval outcomes.

Overall, this study establishes a foundational understanding of the most influential factors affecting loan approval and provides a data-driven basis for future modeling in credit risk evaluation.

Data Modeling

Q: Can we segment borrowers into distinct groups based on their financial profiles to better understand different applicant types?

Filtering numeric columns

data_kmeans <- loan_df[, c("income_annum", "loan_amount", "cibil_score", "residential_assets_value", "commercial_assets_value", "luxury_assets_value", "bank_asset_value")]
cor(data_kmeans)
##                          income_annum loan_amount cibil_score
## income_annum                   1.0000      0.9271     -0.0235
## loan_amount                    0.9271      1.0000     -0.0175
## cibil_score                   -0.0235     -0.0175      1.0000
## residential_assets_value       0.6363      0.5940     -0.0184
## commercial_assets_value        0.6389      0.6016     -0.0053
## luxury_assets_value            0.9287      0.8599     -0.0294
## bank_asset_value               0.8502      0.7871     -0.0154
##                          residential_assets_value commercial_assets_value
## income_annum                               0.6363                  0.6389
## loan_amount                                0.5940                  0.6016
## cibil_score                               -0.0184                 -0.0053
## residential_assets_value                   1.0000                  0.4146
## commercial_assets_value                    0.4146                  1.0000
## luxury_assets_value                        0.5903                  0.5893
## bank_asset_value                           0.5263                  0.5468
##                          luxury_assets_value bank_asset_value
## income_annum                          0.9287           0.8502
## loan_amount                           0.8599           0.7871
## cibil_score                          -0.0294          -0.0154
## residential_assets_value              0.5903           0.5263
## commercial_assets_value               0.5893           0.5468
## luxury_assets_value                   1.0000           0.7876
## bank_asset_value                      0.7876           1.0000

The correlation matrix shows that all asset variables are positively and strongly related, so I combined them into a single total asset value to avoid redundancy.

str(loan_df)
## 'data.frame':    4241 obs. of  12 variables:
##  $ no_of_dependents        : Factor w/ 6 levels "0","1","2","3",..: 3 1 4 4 6 1 6 3 1 6 ...
##  $ education               : Factor w/ 2 levels " Graduate"," Not Graduate": 1 2 1 1 2 1 1 1 1 2 ...
##  $ self_employed           : Factor w/ 2 levels " No"," Yes": 1 2 1 1 2 2 1 2 2 1 ...
##  $ income_annum            : int  9600000 4100000 9100000 8200000 9800000 4800000 8700000 5700000 800000 1100000 ...
##  $ loan_amount             : int  29900000 12200000 29700000 30700000 24200000 13500000 33000000 15000000 2200000 4300000 ...
##  $ loan_term               : int  12 8 20 8 20 10 4 20 20 10 ...
##  $ cibil_score             : int  778 417 506 467 382 319 678 382 782 388 ...
##  $ residential_assets_value: int  2400000 2700000 7100000 18200000 12400000 6800000 22500000 13200000 1300000 3200000 ...
##  $ commercial_assets_value : int  17600000 2200000 4500000 3300000 8200000 8300000 14800000 5700000 800000 1400000 ...
##  $ luxury_assets_value     : int  22700000 8800000 33300000 23300000 29400000 13700000 29200000 11800000 2800000 3300000 ...
##  $ bank_asset_value        : int  8000000 3300000 12800000 7900000 5000000 5100000 4300000 6000000 600000 1600000 ...
##  $ loan_status             : Factor w/ 2 levels " Approved"," Rejected": 1 2 2 2 2 2 1 2 1 2 ...
loan_df$combined_asset_value <- rowSums(
  loan_df[, c("residential_assets_value",
              "commercial_assets_value",
              "luxury_assets_value",
              "bank_asset_value")],
  na.rm = TRUE
)

data_kmeans <- loan_df[, c("loan_amount", "combined_asset_value", "cibil_score", "income_annum")]
str(data_kmeans)
## 'data.frame':    4241 obs. of  4 variables:
##  $ loan_amount         : int  29900000 12200000 29700000 30700000 24200000 13500000 33000000 15000000 2200000 4300000 ...
##  $ combined_asset_value: num  50700000 17000000 57700000 52700000 55000000 33900000 70800000 36700000 5500000 9500000 ...
##  $ cibil_score         : int  778 417 506 467 382 319 678 382 782 388 ...
##  $ income_annum        : int  9600000 4100000 9100000 8200000 9800000 4800000 8700000 5700000 800000 1100000 ...

Scaling data

data_kmeans_scaled <- scale(data_kmeans)

Deciding on optimal number of clusters using elbow plot

wss <- sapply(1:10, function(k){
  kmeans(data_kmeans_scaled, k, nstart = 20)$tot.withinss
})

plot(1:10, wss, type = "b",
     xlab = "Number of clusters (k)",
     ylab = "Total within-cluster sum of squares")

The elbow plot indicates a distinct inflection point at k = 2, which supports selecting two clusters for the analysis.

Run k-means with k=2

set.seed(123) 
k <- 2

km <- kmeans(data_kmeans_scaled, centers = k, nstart = 25)

Calculating silhouette score of the cluster

library(cluster)

sil <- silhouette(km$cluster, dist(data_kmeans_scaled))
mean(sil[, 3])  
## [1] 0.423

The silhouette score of 0.423 indicates that the two borrower clusters are reasonably well-separated, meaning the clustering has captured meaningful differences in income, assets, loan amount, and CIBIL score. While the groups are not perfectly distinct, the separation is strong enough to provide useful insights into different borrower profiles.

Creating cluster profiles

Attaching cluster labels to both the scaled and original datasets, then calculating the average unscaled feature values for each cluster to create interpretable cluster profiles.

# df_scaled already contains cluster labels
df_scaled <- as.data.frame(data_kmeans_scaled)
df_scaled$cluster <- factor(km$cluster)

# -----------------------
# 1. Add cluster labels to ORIGINAL data
# -----------------------
df_unscaled <- data_kmeans      # original (unscaled) data
df_unscaled$cluster <- factor(km$cluster)

# -----------------------
# 2. Compute cluster profiles using unscaled values
# -----------------------
cluster_profiles_unscaled <- df_unscaled %>%
  group_by(cluster) %>%
  summarise(
    loan_amount = mean(loan_amount),
    cibil_score = mean(cibil_score),
    income_annum = mean(income_annum),
    combined_asset_value = mean(combined_asset_value)
  )

print(cluster_profiles_unscaled)
## # A tibble: 2 × 5
##   cluster loan_amount cibil_score income_annum combined_asset_value
##   <fct>         <dbl>       <dbl>        <dbl>                <dbl>
## 1 1         22662709.        594.     7510627.            48803207.
## 2 2          7913197.        606.     2709201.            17009944.

Cluster interpretation

  • Cluster 1: High-value borrowers: This cluster represents financially strong borrowers who request very large loans and possess significant asset holdings and high annual income. Despite having slightly lower CIBIL scores on average, their high net worth and strong cash flow make them lower risk from a collateral perspective.

  • Cluster 2: Moderate-value borrowers: This cluster consists of moderate-income, moderate-asset borrowers who apply for smaller loans and have slightly better credit scores. Their financial profile is less substantial than Cluster 1, but their stronger CIBIL scores might indicate better repayment discipline.

These differences suggest that the clustering effectively separates borrowers based on financial strength and borrowing patterns.

SVM classification to validate cluster separability

library(caret)
library(e1071)

# Combine original features with cluster labels
cluster_data <- df_unscaled   # df_unscaled already includes: loan_amount, combined_asset_value, cibil_score, income_annum, cluster

set.seed(123)

# 80/20 train-test split (stratified by cluster)
index <- createDataPartition(cluster_data$cluster, p = 0.8, list = FALSE)
train_data <- cluster_data[index, ]
test_data  <- cluster_data[-index, ]

table(cluster_data$cluster)    # Check class balance
## 
##    1    2 
## 2089 2152
svm_model <- svm(
  cluster ~ loan_amount + combined_asset_value + cibil_score + income_annum,
  data = train_data,
  kernel = "radial",
  class.weights = table(cluster_data$cluster) / nrow(cluster_data)
)

predictions <- predict(svm_model, newdata = test_data)

confusion_matrix <- table(predictions, test_data$cluster)
print(confusion_matrix)
##            
## predictions   1   2
##           1 415   1
##           2   2 429
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", accuracy))
## [1] "Accuracy: 0.996458087367178"

We trained a Support Vector Machine to classify observations into the two k-means clusters. The model achieved an accuracy of 99.65%, with only 3 misclassifications out of 847 test observations. The confusion matrix shows very strong separation between the clusters, indicating that the cluster structure found by k-means is highly stable and predictable from the input features. This validates that the clusters represent genuinely distinct segments rather than random partitions.

Calinski-Harabasz Index(Cluster internal assessment)

library(fpc)
# calculate CH index
ch <- calinhara(data_kmeans_scaled, km$cluster, cn = 2)
cat("Calinski-Harabasz Index:", ch, "\n")
## Calinski-Harabasz Index: 4531
# compare CH index under different k values
ch_values <- sapply(1:10, function(k) {
  km_temp <- kmeans(data_kmeans_scaled, k, nstart = 20)
  ch <- calinhara(data_kmeans_scaled, km$cluster, cn = k)
})
plot(1:10, ch_values, type = "b", xlab = "k", ylab = "CH Index", main = "CH Index under different K values")

Calculating the CH index for this k-means model. The principle is to calculate the ratio of the variance between clusters to the variance within clusters. The larger the value, the greater the difference between clusters and the more compact the clusters within. The result obtained here is 4531, which is a quite large value, indicating that the clustering result is quite good. Then I compared the CH index under different k values. As can be seen from the graph, when k = 2, the CH index is the largest, which once again confirms the same result as the elbow principle - when the number of clusters is 2, it is the most suitable.

Bootstrap sampling verification(Cluster stability assessment)

library(fpc)
set.seed(123)
stab <- clusterboot(data_kmeans_scaled, 
                    clustermethod = kmeansCBI, 
                    k = 2, 
                    runs = 100, 
                    seed = 123)
## boot 1 
## boot 2 
## boot 3 
## boot 4 
## boot 5 
## boot 6 
## boot 7 
## boot 8 
## boot 9 
## boot 10 
## boot 11 
## boot 12 
## boot 13 
## boot 14 
## boot 15 
## boot 16 
## boot 17 
## boot 18 
## boot 19 
## boot 20 
## boot 21 
## boot 22 
## boot 23 
## boot 24 
## boot 25 
## boot 26 
## boot 27 
## boot 28 
## boot 29 
## boot 30 
## boot 31 
## boot 32 
## boot 33 
## boot 34 
## boot 35 
## boot 36 
## boot 37 
## boot 38 
## boot 39 
## boot 40 
## boot 41 
## boot 42 
## boot 43 
## boot 44 
## boot 45 
## boot 46 
## boot 47 
## boot 48 
## boot 49 
## boot 50 
## boot 51 
## boot 52 
## boot 53 
## boot 54 
## boot 55 
## boot 56 
## boot 57 
## boot 58 
## boot 59 
## boot 60 
## boot 61 
## boot 62 
## boot 63 
## boot 64 
## boot 65 
## boot 66 
## boot 67 
## boot 68 
## boot 69 
## boot 70 
## boot 71 
## boot 72 
## boot 73 
## boot 74 
## boot 75 
## boot 76 
## boot 77 
## boot 78 
## boot 79 
## boot 80 
## boot 81 
## boot 82 
## boot 83 
## boot 84 
## boot 85 
## boot 86 
## boot 87 
## boot 88 
## boot 89 
## boot 90 
## boot 91 
## boot 92 
## boot 93 
## boot 94 
## boot 95 
## boot 96 
## boot 97 
## boot 98 
## boot 99 
## boot 100
print(stab$bootmean)
## [1] 0.994 0.995
print(stab$bootbrd)  
## [1] 0 0

In the bootstrap sampling verification, this model achieved excellent results with bootmean = 0.994, 0.995 and bootbrd = 0, 0. This indicates that this K-means model has extremely strong clustering stability. The value of bootmean is extremely close to 1, indicating that the two clusters can be consistently reproduced in over 99% of the samples, and there is almost no “false clusters caused by sampling randomness”. “bootbrd” represents the ambiguity of the cluster boundary. The smaller the value, the clearer the cluster boundary. This model has reached the theoretical minimum value. The boundaries of the two clusters are almost completely clear. These two clusters are genuine structures in the data, and not the result of arbitrary division by the clustering algorithm.

For a better understanding, this is the result of PCA dimensionality reduction visualization.

cluster_factor <- factor(km$cluster, levels = c(1, 2), labels = c("Cluster 1", "Cluster 2"))
library(FactoMineR)
library(factoextra)
library(ggplot2)
pca <- PCA(data_kmeans_scaled, graph = FALSE)
fviz_pca_ind(
  pca, 
  geom.ind = "point", 
  col.ind = cluster_factor,
  palette = c("#2E9FDF", "#E7B800"),
  legend.title = "Cluster",
  title = "PCA Visualization Result",
  xlab = "PC1", 
  ylab = "PC2",
  repel = TRUE
) + 
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

We can see two clusters can be observed to be almost completely separated along the PC1 axis (with only a very small amount of overlap near PC1 = 0)

Parallel Coordinate Plot(Visual verification)

library(GGally)
plot_data <- cbind(data_kmeans_scaled, cluster = factor(km$cluster))
ggparcoord(plot_data, 
           columns = 1:4, 
           groupColumn = "cluster",
           scale = "globalminmax",
           alphaLines = 0.5) +
  labs(title = "Cluster distribution of each feature") +
  theme_minimal()

This parallel coordinate plot illustrates the distribution differences of the two clusters in terms of the four standardized features. Cluster 1 has dark tones, while Cluster 2 has light tones. Firstly, the dark and light color areas in the figure hardly overlap, indicating that the characteristic trajectories of the two clusters are significantly different. Secondly, this graph visually verifies the characteristics of the two clusters previously separated. The lines of the high-value cluster are concentrated in the area of “high loan/asset/income and low CIBIL score”, while the lines of the low-value cluster are concentrated in the area of “low loan/asset/income and high CIBIL score”.

About t-test and ANOVA test

I think t-tests and ANOVA tests are of little significance because K-means is already used to group variables. If these testing methods are employed, the results will definitely be significant.

Q: Which set of predictors—1, 2, or 3—provides the optimal balance of predictive power and model complexity for determining loan approval?

The three variables we decided to choose are the CIBIL score, loan amount and loan term.These three would be important because they are vital for the credit risk assessment.

Null Hypothesis: H0 = All three models are equally effective indicators, or the simplest model is the best since it factors in adding more variables (AIC1≈AIC2≈AIC3)
Alternative Hypothesis: HA= the model with the minimum AIC is the best indicator, demonstrating a statistically superior balance of fit and model simplicity compared to the other two.

First let us run Chi-squared tests to determine dependence of variables.

chisq.test(loan_df$loan_status, loan_df$cibil_score)
## 
##  Pearson's Chi-squared test
## 
## data:  loan_df$loan_status and loan_df$cibil_score
## X-squared = 3597, df = 600, p-value <2e-16
chisq.test(loan_df$loan_status, loan_df$loan_amount)
## 
##  Pearson's Chi-squared test
## 
## data:  loan_df$loan_status and loan_df$loan_amount
## X-squared = 344, df = 377, p-value = 0.9
chisq.test(loan_df$loan_status, loan_df$loan_term)
## 
##  Pearson's Chi-squared test
## 
## data:  loan_df$loan_status and loan_df$loan_term
## X-squared = 150, df = 9, p-value <2e-16

These tests see if two variables are significantly associated with each other. These Chi-squared test on the three variable, CIBIL credit score, loan amount, and loan term, indicate that the CIBIL credit score and loan term are the variables with a p-value lower than the standard significance of .05, both having a p-value being 2.2e-16. This means we reject the null hypothesis that that CIBIL score and loan approval are independent and have no relationship. This means that the CIBIL credit score is dependent on the loan status and that the loan term is significantly associated with the loan status. We fail to reject the null hypotheses for the loan amount compared to loan status, as the p-values is greater than .05. The loan amount p-value being .8637. This means that the loan amount and loan approval are independent from each other.

library(regclass)
library(ResourceSelection)

LogitM1 <- glm(loan_status ~ cibil_score, data = loan_df, family = "binomial")
summary(LogitM1)
## 
## Call:
## glm(formula = loan_status ~ cibil_score, family = "binomial", 
##     data = loan_df)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 11.452352   0.373349    30.7   <2e-16 ***
## cibil_score -0.021735   0.000698   -31.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5622.1  on 4240  degrees of freedom
## Residual deviance: 2122.0  on 4239  degrees of freedom
## AIC: 2126
## 
## Number of Fisher Scoring iterations: 7
#evaluation step - confidence interval
confint.default(LogitM1)
##               2.5 %  97.5 %
## (Intercept) 10.7206 12.1841
## cibil_score -0.0231 -0.0204
#evaluation step - confusion matrix
confusion_matrix(LogitM1)
##                  Predicted  Approved Predicted  Rejected Total
## Actual  Approved                2473                 167  2640
## Actual  Rejected                 182                1419  1601
## Total                           2655                1586  4241
#evaluation step - Hosmer and Lemeshow
hoslem.test(as.numeric(loan_df$loan_status) - 1, fitted(LogitM1))
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.numeric(loan_df$loan_status) - 1, fitted(LogitM1)
## X-squared = 623, df = 8, p-value <2e-16
#evaluation step - McFadden R^2
null_tLogit <- glm(loan_status ~ 1, data = loan_df, family = "binomial")
mcFadden = 1 - logLik(LogitM1) / logLik(null_tLogit)
cat("McFadden R-squared: ", format(mcFadden, digits=3), "\n")
## McFadden R-squared:  0.623
LogitM2 <- glm(loan_status ~ cibil_score + loan_term, data = loan_df, family = "binomial")
summary(LogitM2)
## 
## Call:
## glm(formula = loan_status ~ cibil_score + loan_term, family = "binomial", 
##     data = loan_df)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 11.136427   0.396392    28.1   <2e-16 ***
## cibil_score -0.024212   0.000811   -29.9   <2e-16 ***
## loan_term    0.148309   0.011223    13.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5622.1  on 4240  degrees of freedom
## Residual deviance: 1919.0  on 4238  degrees of freedom
## AIC: 1925
## 
## Number of Fisher Scoring iterations: 7
#evaluation step - confidence interval
confint.default(LogitM2)
##               2.5 %  97.5 %
## (Intercept) 10.3595 11.9133
## cibil_score -0.0258 -0.0226
## loan_term    0.1263  0.1703
#evaluation step - confusion matrix
confusion_matrix(LogitM2)
##                  Predicted  Approved Predicted  Rejected Total
## Actual  Approved                2466                 174  2640
## Actual  Rejected                 183                1418  1601
## Total                           2649                1592  4241
#evaluation step - Hosmer and Lemeshow
hoslem.test(as.numeric(loan_df$loan_status) - 1, fitted(LogitM2))
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.numeric(loan_df$loan_status) - 1, fitted(LogitM2)
## X-squared = 182, df = 8, p-value <2e-16
#evaluation step - McFadden R^2
null_tLogit <- glm(loan_status ~ 1, data = loan_df, family = "binomial")
mcFadden = 1 - logLik(LogitM2) / logLik(null_tLogit)
cat("McFadden R-squared: ", format(mcFadden, digits=3), "\n")
## McFadden R-squared:  0.659
LogitM3 <- glm(loan_status ~ cibil_score + loan_term + loan_amount, data = loan_df, family = "binomial")
summary(LogitM3)
## 
## Call:
## glm(formula = loan_status ~ cibil_score + loan_term + loan_amount, 
##     family = "binomial", data = loan_df)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.14e+01   4.17e-01   27.36   <2e-16 ***
## cibil_score -2.43e-02   8.14e-04  -29.82   <2e-16 ***
## loan_term    1.49e-01   1.12e-02   13.22   <2e-16 ***
## loan_amount -1.53e-08   6.43e-09   -2.38    0.017 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5622.1  on 4240  degrees of freedom
## Residual deviance: 1913.3  on 4237  degrees of freedom
## AIC: 1921
## 
## Number of Fisher Scoring iterations: 7
#evaluation step - confidence interval
confint.default(LogitM3)
##                 2.5 %    97.5 %
## (Intercept)  1.06e+01  1.22e+01
## cibil_score -2.59e-02 -2.27e-02
## loan_term    1.26e-01  1.71e-01
## loan_amount -2.79e-08 -2.72e-09
#evaluation step - confusion matrix
confusion_matrix(LogitM3)
##                  Predicted  Approved Predicted  Rejected Total
## Actual  Approved                2466                 174  2640
## Actual  Rejected                 181                1420  1601
## Total                           2647                1594  4241
#evaluation step - Hosmer and Lemeshow
hoslem.test(as.numeric(loan_df$loan_status) - 1, fitted(LogitM3))
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.numeric(loan_df$loan_status) - 1, fitted(LogitM3)
## X-squared = 181, df = 8, p-value <2e-16
#evaluation step - McFadden R^2
null_tLogit <- glm(loan_status ~ 1, data = loan_df, family = "binomial")
mcFadden = 1 - logLik(LogitM3) / logLik(null_tLogit)
cat("McFadden R-squared: ", format(mcFadden, digits=3), "\n")
## McFadden R-squared:  0.66

All three models are highly significant to affect on loan approval, as they all have p-values less than the standard level of statistical significance of .05. Model 1 uses CIBIL Score only. It is highly significant and explains a substantial portion of variance with an R^2 of 0.620. Its high AIC 2155.6 indicates it’s the least efficient model. Model 2 is the same as Model I but with the introduction of the Loan Term. It shows an improvement in fit since the AIC drops to 1959.4 and increases the R^2 to .655. This model confirms that both CIBIL Score and Loan Term are crucial, highly significant predictors. Utilizing these two variables can capture the majority of the available predictive power in regards to loan approval. Model 3 is Model two with the addition Loan Amount. It achieves the best fit with the lowest AIC of 1954.7 and highest R^2 of .656, and all three variables are now statistically significant. The variables included are all statistically significant in this model, as the p-value of the loan-amount is .00967. This means we reject the null hypothesis of a non-significant variable and say that every variable is contributing statistically unique and valuable information to the model’s fit.

To answer the question that is posed, we reject the null hypothesis and state that Model3 with the three predictors, which include CIBIL score, loan term, and loan amount, is the best predictor and it outweighs the penalty of adding more parameters. Since Model3 has the highest McFadden R^2, lowest AIC, and variables being significant, it shows that Model3 accounts for a slightly greater proportion of variability in loan approval status than the other models.

References

Mnkandla, A. Z., Ndlovu, B. M., Dube, S., Nyoni, P., & Kiwa, F. J. (2024). Loan eligibility system using machine learning. Proceedings of the 7th European Conference on Industrial Engineering and Operations Management. https://doi.org/10.46254/EU07.20240079
Sharma, A. (2023). Loan approval prediction dataset. https://www.kaggle.com/datasets/architsharma01/loan-approval-prediction-dataset/data